This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

1/11/19 Chapter 5: Data Transformation Setting Up Libraries

rr library(tidyverse) library(nycflights13) library(dplyr)

dplyr overwrites and you need stats::filter() and stats::lag()

rr flights

int for integers dbl stands for doubles, or real numbers chr stands for cahracter vectors, or strings dttm stands fo date-times (a date + time) lgl is logical fctr factors, categorical data dates stands for dates

dplyr basics

pick by values: filter reorder: arrange pick variables: select create new variables from existing existing varaibles: mutate mutate in place: transmute

rr filter(flights, month == 1, day == 1)

Here we filter for Jan 1 flights

To set a variable and print results, run with a parentheses

rr (dec25 <- filter(flights, month == 12, day == 25))

floating point numbers are not good with the == boolean, so instead use the near() function

rr sqrt(2)^2 == 2

[1] FALSE

rr near(sqrt(2)^2, 2)

[1] TRUE

Since the computer uses an approximation fo the calculation, the boolean returns false.

& Intersection (Must be in X and Y) | Union (In X and Y or in just X or in just Y) ! Composite (Not in X)

rr filter(flights, month == 11| month == 12)

We can’t use month == 11 | 12 as that transforms the parameter into a logical True/False Statement. The binary statement is understood as a 0/1 (Remember binary) and thus the statements turns into month == 1 because it is true that 11 union 12 are are contained in one another.

e.g.

rr filter(flights, month == 11|12)

So instead for shorthand you can use

rr filter(flights, month %in% c(11,12))

The c stands for concatenate and creates a vector of numbers to be read by R. The %in% is a dplyr thing that I really don’t know much about.

Missing or unknown values are known as NA don’t try to filter them with logicals Use is.na() unction instead

rr arrange(flights, desc(dep_delay))

Now go on to filtering variables through select

rr select(flights,year, month, day) r select(flights,year:day)

rr select(flights, -(year:day))

There are a number of helper functions you can use within select():

starts_with("abc"): matches names that begin with "abc".

ends_with("xyz"): matches names that end with "xyz".

contains("ijk"): matches names that contain "ijk".

matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You'll learn more about regular expressions in strings.

num_range("x", 1:3): matches x1, x2 and x3.

rr rename(flights, tail_num = tailnum)

Then we also have hte helper everythin() which can help you move things to the begining of the dataframe

rr select(flights, time_hour, air_time, everything())

Ecercises

rr select(flights, time_hour, time_hour)

Add new variables with mutate

rr flights_sml <- select(flights, year:day, ends_with(), distance, air_time) mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60)

rr mutate(flights_sml, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours )

to keep only the new variables we use transmute

rr transmute(flights, gain = dep_delay - arr_delay, hours = air_time / 60, gain_per_hour = gain / hours )

useful functions

rr transmute(flights, dep_time, hour = dep_time %/% 100, minute = dep_time %% 100 )

use log(), log2() and log10()

Use log2 cause its easy to interpret

Use lead and lag to find the next and previous value in the vector

rr x <- 1:10 lag(x)

 [1] NA  1  2  3  4  5  6  7  8  9

rr lead(x)

 [1]  2  3  4  5  6  7  8  9 10 NA

Lastly summarise will collapse a data frame into a single row

rr summarise(flights, delay = mean(dep_delay, na.rm = T))

rr by_day <- group_by(flights, year, month, day) summarise(by_day, delay = mean(dep_delay), na.rm = TRUE)

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest, 
                   count = n(),
                   dist = mean(distance, na.rm = TRUE),
                   delay = mean(arr_delay, na.rm = TRUE)
                   )
delay <- filter(delay, count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) + 
  geom_smooth(se = FALSE)

Using a pipe

delays <- flights %>%
  group_by(dest) %>%
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) + 
  geom_smooth(se = FALSE)

Think of the pipe %>% as a “and Then” na.rm is to remove missing values

Counts

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))

rr delays <- not_cancelled %>% group_by(tailnum) %>% summarise( delay = mean(arr_delay) ) ggplot(data=delays, mapping = aes(x = delay)) + geom_freqpoly(binwidth = 10)

delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise( delay = mean(arr_delay, na.rm = T),
             n = n()
             )
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
  geom_point(alpha = 0.1)

library(Lahman)
batting <- as_tibble(Lahman::Batting)
batters <- batting %>%
  group_by(playerID) %>%
  summarise(
    ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    ab = sum(AB, na.rm = TRUE)
  )
batters %>%
  filter(ba > 100) %>%
  ggplot(mapping = aes(x = ab, y = ba)) +
  geom_point() + 
  geom_smooth(se = FALSE)

Chapter 7 Loading Libraries

library(tidyverse)

Two types of questions that are really important to your research

  1. What type of variation occurs within my variables?
  2. What type of covariation occurs between my variables

Varaible - measuarable thing Value - state of variable when measured observation - set of values made under similar conditions tabular data - observations x variables

Variation - tendency of values to change from measurement to measurement

Visualizing distributions can be used to understand the pattern of variation within the data

---
title: "R for Data Sciences from Chapter 5"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter*. 

Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Ctrl+Alt+I*.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

1/11/19
Chapter 5: Data Transformation
Setting Up Libraries

```{r}
library(tidyverse)
library(nycflights13)
library(dplyr)
```

dplyr overwrites and you need stats::filter() and stats::lag()

```{r}
flights
```

int for integers
dbl stands for doubles, or real numbers
chr stands for cahracter vectors, or strings
dttm stands fo date-times (a date + time)
lgl is logical 
fctr factors, categorical data
dates stands for dates


dplyr basics

pick by values: filter
reorder: arrange
pick variables: select
create new variables from existing existing varaibles: mutate
mutate in place: transmute

```{r}
filter(flights, month == 1, day == 1)
```
Here we filter for Jan 1 flights

To set a variable and print results, run with a parentheses

```{r}
(dec25 <- filter(flights, month == 12, day == 25))
```

floating point numbers are not good with the  == boolean, so instead use the near() function

```{r}
sqrt(2)^2 == 2
near(sqrt(2)^2, 2)
```

Since the computer uses an approximation fo the calculation, the boolean returns false. 

& Intersection (Must be in X and Y)
| Union (In X and Y or in just X or in just Y)
! Composite (Not in X)

```{r}
filter(flights, month == 11| month == 12)
```

We can't use month == 11 | 12 as that transforms the parameter into a logical True/False Statement. The binary statement is understood as a 0/1 (Remember binary) and thus the statements turns into month == 1 because it is true that 11 union 12 are are contained in one another.

e.g.
```{r}
filter(flights, month == 11|12)
```
So instead for shorthand you can use
```{r}
filter(flights, month %in% c(11,12))
```
The c stands for concatenate and creates a vector of numbers to be read by R. The %in% is a dplyr thing that I really don't know much about.

Missing or unknown values are known as NA
don't try to filter them with logicals
Use is.na() unction instead

```{r}
arrange(flights, is.na())

```

Now go on to filtering variables through select

```{r}
select(flights,year, month, day)
select(flights,year:day)
select(flights, -(year:day))
```

There are a number of helper functions you can use within select():

    starts_with("abc"): matches names that begin with "abc".

    ends_with("xyz"): matches names that end with "xyz".

    contains("ijk"): matches names that contain "ijk".

    matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You'll learn more about regular expressions in strings.

    num_range("x", 1:3): matches x1, x2 and x3.

```{r}
rename(flights, tail_num = tailnum)
```

Then we also have hte helper everythin() which can help you move things to the begining of the dataframe

```{r}
select(flights, time_hour, air_time, everything())
```

Ecercises
```{r}
select(flights, time_hour, time_hour)
```
Add new variables with mutate

```{r}
flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time)

mutate(flights_sml, gain = dep_delay - arr_delay,
       speed = distance / air_time * 60)
```
```{r}
mutate(flights_sml,
       gain = dep_delay - arr_delay,
       hours = air_time / 60,
       gain_per_hour = gain / hours
       )
```

to keep only the new variables we use transmute
```{r}
transmute(flights, 
          gain = dep_delay - arr_delay,
          hours = air_time / 60,
          gain_per_hour = gain / hours
          )
```

useful functions
```{r}
transmute(flights,
          dep_time,
          hour = dep_time %/% 100,
          minute = dep_time %% 100
          )
```
use log(), log2() and log10()

Use log2 cause its easy to interpret

Use lead and lag to find the next and previous value in the vector

```{r}
x <- 1:10
lag(x)
lead(x)
```
 Lastly summarise will collapse a data frame into a single row
```{r}
summarise(flights, delay = mean(dep_delay, na.rm = T))
```
 
```{r}
by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay), na.rm = TRUE)
```
```{r}
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest, 
                   count = n(),
                   dist = mean(distance, na.rm = TRUE),
                   delay = mean(arr_delay, na.rm = TRUE)
                   )
delay <- filter(delay, count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) + 
  geom_smooth(se = FALSE)
```

Using a pipe
```{r}
delays <- flights %>%
  group_by(dest) %>%
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>%
  filter(count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) + 
  geom_smooth(se = FALSE)
```


Think of the pipe %>% as a "and Then"
na.rm is to remove missing values

Counts 
```{r}
not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(mean = mean(dep_delay))
```

```{r}
delays <-  not_cancelled %>% 
  group_by(tailnum) %>%
  summarise(
    delay = mean(arr_delay)
  )

ggplot(data=delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10)
```
```{r}
delays <- not_cancelled %>%
  group_by(tailnum) %>%
  summarise( delay = mean(arr_delay, na.rm = T),
             n = n()
             )
ggplot(data = delays, mapping = aes(x = n, y = delay)) +
  geom_point(alpha = 0.1)
```
```{r}
delays %>%
  filter(n > 25) %>%
  ggplot(mapping = aes(x = n, y = delay)) +
  geom_point(alpha = 0.1)
```

```{r}
library(Lahman)
batting <- as_tibble(Lahman::Batting)

batters <- batting %>%
  group_by(playerID) %>%
  summarise(
    ba = sum(H, na.rm = TRUE) / sum(AB, na.rm = TRUE),
    ab = sum(AB, na.rm = TRUE)
  )

batters %>%
  filter(ba > 100) %>%
  ggplot(mapping = aes(x = ab, y = ba)) +
  geom_point() + 
  geom_smooth(se = FALSE)
```

Chapter 7
Loading Libraries
```{r}
library(tidyverse)
```

Two types of questions that are really important to your research

1. What type of variation occurs within my variables?
2. What type of covariation occurs between my variables

Varaible - measuarable thing
Value - state of variable when measured
observation - set of values made under similar conditions
tabular data - observations x variables 

Variation - tendency of values to change from measurement to measurement

Visualizing distributions can be used to understand the pattern of variation within the data

